1 Normalisation - Caron set

Sources: chapters on normalisation in the OSCA book and the Hemberg group materials.

1.1 Why normalise?

Systematic differences in sequencing coverage between libraries occur because of low input material, differences in cDNA capture and PCR amplification. Normalisation removes such differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells. Normalisation and batch correction have different aims. Normalisation addresses technical differences only, while batch correction considers both technical and biological differences.

1.2 Learning objectives

  • Understand why normalisation is required
  • Understand concepts of two normalisation methods
    • deconvolution
    • SCTransform
#normPlotDirBit <- "Plots/Norm" # not use anymore

projDir <- ".." # or /home/ubuntu/CourseMaterials/scRNAseq
dirRel <- ".."
outDirBit <- "CourseMaterials"
setName <- tolower("Caron")
writeRds <- TRUE # FALSE

# create folder for plots for normalisation:
##dir.create(sprintf("%s/%s/%s", projDir, outDirBit, normPlotDirBit),
##           showWarnings = FALSE,
##recursive = TRUE)

1.3 Load object

We will load the R object created after QC.

getwd()
## [1] "/home/ubuntu/Course_Materials/scRNAseq/Scripts"
# Read object in:
dir("../CourseMaterials/Robjects/")
## [1] "Caron_filtered.rds"                "DataIntergration_all_sce.rds"     
## [3] "README.txt"                        "caron_postDeconv_5hCellPerSpl.Rds"
## [5] "caron_postQc_5hCellPerSpl.Rds"     "caron_postSct_5hCellPerSpl.Rds"
tmpFn <- "../CourseMaterials/Robjects/Caron_filtered.rds"
sce <- readRDS(tmpFn)
sce
## class: SingleCellExperiment 
## dim: 25632 46568 
## metadata(1): Samples
## assays(1): counts
## rownames(25632): ENSG00000238009 ENSG00000239945 ... ENSG00000275063
##   ENSG00000271254
## rowData names(4): ID Symbol Type Chromosome
## colnames(46568): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(14): Sample Barcode ... high_Mito_percent discard
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
# PATCH names
samplesheet <- readr::read_tsv("../CourseMaterials/Data/sample_sheet.tsv")
samplesheet %>%
    as.data.frame() %>%
    DT::datatable(rownames = FALSE, options = list(dom="tpl", nrows=20))
dd <- colData(sce) %>% data.frame()
dd$SampleId <- dd$Sample
dd <- dd  %>%
  left_join(samplesheet[, c("SampleId", "SampleName")], by="SampleId") %>%
  DataFrame()
colData(sce) <- dd  

Subsample cells down to 500 per sample

setSuf <- "_5hCellPerSpl"
nbCells <- 500
# have new list of cell barcodes for each sample
sce.nz.master <- sce
vec.bc <- colData(sce.nz.master) %>%
    data.frame() %>%
    filter(!SampleId == "SRR9264351") %>%
    group_by(SampleName) %>%
    sample_n(nbCells) %>%
    pull(Barcode)

Number of cells in the sample:

table(colData(sce.nz.master)$Barcode %in% vec.bc)
## 
## FALSE  TRUE 
## 41068  5500

Subset cells from the SCE object:

tmpInd <- which(colData(sce.nz.master)$Barcode %in% vec.bc) # mind QC metrics will be wrong
sce <- sce.nz.master[,tmpInd]

Check columns data:

head(colData(sce))
## DataFrame with 6 rows and 16 columns
##        Sample            Barcode SampleGroup DatasetName       sum  detected
##   <character>        <character> <character> <character> <numeric> <integer>
## 1  SRR9264343 AAACGGGCAGTTCATG-1  ETV6-RUNX1       Caron      7839      2223
## 2  SRR9264343 AAACGGGGTTCACCTC-1  ETV6-RUNX1       Caron      9280      2764
## 3  SRR9264343 AAAGATGAGCGATGAC-1  ETV6-RUNX1       Caron      1633       865
## 4  SRR9264343 AAAGATGCAGCCAATT-1  ETV6-RUNX1       Caron      5965      1990
## 5  SRR9264343 AAAGTAGCAATGCCAT-1  ETV6-RUNX1       Caron      5620      1879
## 6  SRR9264343 AAATGCCAGGTCATCT-1  ETV6-RUNX1       Caron      7793      2815
##   subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent     total
##          <numeric>             <integer>            <numeric> <numeric>
## 1              428                    11              5.45988      7839
## 2              740                    12              7.97414      9280
## 3               74                     9              4.53154      1633
## 4              402                    11              6.73931      5965
## 5              432                    12              7.68683      5620
## 6              266                    11              3.41332      7793
##   low_lib_size low_n_features high_Mito_percent   discard    SampleId
##      <logical>      <logical>         <logical> <logical> <character>
## 1        FALSE          FALSE             FALSE     FALSE  SRR9264343
## 2        FALSE          FALSE             FALSE     FALSE  SRR9264343
## 3        FALSE          FALSE             FALSE     FALSE  SRR9264343
## 4        FALSE          FALSE             FALSE     FALSE  SRR9264343
## 5        FALSE          FALSE             FALSE     FALSE  SRR9264343
## 6        FALSE          FALSE             FALSE     FALSE  SRR9264343
##     SampleName
##    <character>
## 1 ETV6-RUNX1_1
## 2 ETV6-RUNX1_1
## 3 ETV6-RUNX1_1
## 4 ETV6-RUNX1_1
## 5 ETV6-RUNX1_1
## 6 ETV6-RUNX1_1
table(colData(sce)$SampleName)
## 
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4        HHD_1        HHD_2 
##          500          500          500          500          500          500 
##      PBMMC_1      PBMMC_2      PBMMC_3      PRE-T_1      PRE-T_2 
##          500          500          500          500          500

We write the R object to ‘caron_postQc_5hCellPerSpl.Rds’.

# Write object to file
#getwd()
tmpFn <- sprintf("%s/%s/Robjects/%s_postQc%s.Rds",
         projDir, outDirBit, setName, setSuf)
saveRDS(sce, tmpFn)
# Write object to file
tmpFn <- sprintf("%s/%s/Robjects/%s_postQc%s.Rds",
         projDir, outDirBit, setName, setSuf)
sce <- readRDS(tmpFn)

1.4 Scaling normalization

In scaling normalization, the “normalization factor” is an estimate of the library size relative to the other cells. Steps usually include: computation of a cell-specific ‘scaling’ or ‘size’ factor that represents the relative bias in that cell and division of all counts for the cell by that factor to remove that bias. Assumption: any cell specific bias will affect genes the same way.

Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works fine for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.

1.4.1 CPM

Convert raw counts to counts-per-million (CPM) for each cell by dividing counts by the library size then multiplying by 1.000.000. Mind that this does not adress compositional bias caused by highly expressed genes that are also differentially expressed betwenn cells. In scater CPMs are computed with the following code:

calc_cpm <- function (expr_mat, spikes = NULL) 
{
    norm_factor <- colSums(expr_mat[-spikes, ])
    return(t(t(expr_mat)/norm_factor)) * 10^6
}

We will use scater’s calculateCPM()

1.4.2 DESeq’s size factor

For each gene, compute geometric mean across cells. for each cell compute for each gene the ratio of its expression to its geometric mean, and derive the cell’s size factor as the median ratio across genes. Not suitable for sparse scRNA-seq data as the geometric is computed on non-zero values only. This method is also known as ‘Relative Log Expression’ (RLE) in edgeR and scater.

Example code:

calc_sf <- function (expr_mat, spikes = NULL) 
{
    geomeans <- exp(rowMeans(log(expr_mat[-spikes, ])))
    SF <- function(cnts) {
        median((cnts/geomeans)[(is.finite(geomeans) &
                geomeans > 0)])
    }
    norm_factor <- apply(expr_mat[-spikes, ], 2, SF)
    return(t(t(expr_mat)/norm_factor))
}

1.4.3 Weighted Trimmed mean of M-values

To compute weighted Trimmed mean of M-values (TMM), a given cell is chosen as a reference to use in computation for other cells. The M-values are gene-wise log2-fold changes between cells. Trimming entails the removal of the top and bottom 30% of values. The size factor is computed as the average for the remaining cells with a weight according to inverse variances. This method assumes that most genes are not differentially expressed, and the 40% lof genes left after trimming may include many zero counts.

sizeFactors(sce) <- edgeR::calcNormFactors(counts(sce), method = "TMM")

1.4.4 Library size normalization

For each cell, the library size factor is proportional to the library size such that the average size factor across cell is one.

Advantage: normalised counts are on the same scale as the initial counts.

Compute size factors:

lib.sf <- librarySizeFactors(sce)
summary(lib.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1193  0.4321  0.7445  1.0000  1.3085  7.7826

Size factor distribution: wide range, typical of scRNA-seq data.

#hist(log10(lib.sf), xlab="Log10[Size factor]", col='grey80')

dd <- data.frame("log10libSf"=log10(lib.sf))
ggplot(dd, aes(x=log10libSf)) + 
  geom_histogram(bins=50)

Assumption: absence of compositional bias; differential expression between two cells is balanced: upregulation in some genes is accompanied by downregulation of other genes. Not observed.

Inaccurate normalisation due to unaccounted-for composition bias affects the size of the log fold change measured between clusters, but less so the clustering itself. It is thus sufficient to identify clusters and top marker genes.

1.4.5 Deconvolution

Composition bias occurs when differential expression beteween two samples or here cells is not balanced. For a fixed library size, identical in both cells, upregulation of one gene in a cell will means fewer UMIs can be assigned to other genes, which would then appear down regulated. Even if library sizes are allowed to differ, with that for the cell with upregulation being higher, scaling normalisation will reduce normalised counts. Non-upregulated would therefore also appear downregulated.

For bulk RNA-seq, composition bias is removed by assuming that most genes are not differentially expressed between samples, so that differences in non-DE genes would amount to the bias, and used to compute size factors.

Given the sparsity of scRNA-seq data, the methods are not appropriate.

The method below increases read counts by pooling cells into groups, computing size factors within each of these groups and scaling them so they are comparable across clusters. This process is repeated many times, changing pools each time to collect several size factors for each cell, from which is derived a single value for that cell.

tmpFn <- sprintf("%s/Images/scran_Fig3.png", "..")
knitr::include_graphics(tmpFn, auto_pdf = TRUE)

rm(tmpFn)

Cluster cells then normalise.

1.4.5.1 Cluster cells

set.seed(100) # clusters with PCA from irlba with approximation
clust <- quickCluster(sce, BPPARAM=bpp) # slow with all cells.
table(clust)
## clust
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 436 255 202 945 481 330 555 279 288 382 277 126 247 132 156 211 198
# write to file
##tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_quickClus%s.Rds",
##       projDir, outDirBit, setName, setSuf)
##saveRDS(clust, tmpFn)
# read from file
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_quickClus%s.Rds",
         projDir, outDirBit, setName, setSuf)
clust <- readRDS(tmpFn)

1.4.5.2 Compute size factors

#deconv.sf <- calculateSumFactors(sce, cluster=clust)
sce <- computeSumFactors(sce,
             cluster = clust,
             min.mean = 0.1,
             BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)

# write to file
##tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_deconvSf%s.Rds",
##       projDir, outDirBit, setName, setSuf)
##saveRDS(deconv.sf, tmpFn)
# read from file
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_deconvSf%s.Rds",
         projDir, outDirBit, setName, setSuf)
deconv.sf <- readRDS(tmpFn)
summary(deconv.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02495 0.38799 0.72628 1.00000 1.34389 8.48425

Plot deconvolution size factors against library size factors:

plot(lib.sf,
     deconv.sf,
     xlab="Library size factor",
     ylab="Deconvolution size factor",
     log='xy',
     pch=16,
     col=as.integer(factor(sce$SampleGroup)))
abline(a=0, b=1, col="red")
sce <- addPerFeatureQC(sce, BPPARAM = bpp) # PATCH

colData(sce)$cell_sparsity <- 1 - (colData(sce)$detected / nrow(sce))
rowData(sce)$gene_sparsity <- (100 - rowData(sce)$detected) / 100

deconvDf <- data.frame(lib.sf, deconv.sf,
            "source_name" = sce$SampleGroup,
            "sum" = sce$sum,
            "mito_content" = sce$subsets_Mito_percent,
            "cell_sparsity" = sce$cell_sparsity)
# colour by sample type
sp <- ggplot(deconvDf, aes(x=lib.sf, y=deconv.sf, col=source_name)) +
  geom_point()
# Split by sample type
sp + facet_wrap(~source_name)
# colour by cell sparsity
sp <- ggplot(deconvDf, aes(x=lib.sf, y=deconv.sf, col=cell_sparsity)) +
  geom_point()
sp

1.4.5.3 Apply size factors

For each cell, raw counts for genes are divided by the size factor for that cell and log-transformed so downstream analyses focus on genes with strong relative differences. We use scater::logNormCounts().

sce <- logNormCounts(sce) # adds logcounts
print(assayNames(sce))
## [1] "counts"    "logcounts"

1.4.5.4 Save object

sceDeconv <- sce
# write to file
tmpFn <- sprintf("%s/%s/Robjects/%s_postDeconv%s.Rds",
         projDir, outDirBit, setName, setSuf)
saveRDS(sceDeconv, tmpFn)

1.5 Exercise 1

Exercise: apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1 (aka GSM3872434).

You first load the same object we loaded earlier, then select cells for SampleName ‘ETV6-RUNX1_1’. You will then cluster cells, compute and apply size factors.

1.6 SCTransform

With scaling normalisation a correlation remains between the mean and variation of expression (heteroskedasticity). This affects downstream dimensionality reduction as the few main new dimensions are usually correlated with library size. SCTransform addresses the issue by regressing library size out of raw counts and providing residuals to use as normalized and variance-stabilized expression values in downstream analysis. We will use the sctransform vignette.

counts <- counts(sce)
print(class(counts))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
print(dim(counts))
## [1] 25632  5500
colnames(counts) <- colData(sce)$Barcode

1.6.1 Inspect data

We will now calculate some properties and visually inspect the data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step, but we focus on the global trends.

Derive gene and cell attributes from the UMI matrix.

gene_attr <- data.frame(mean = rowMeans(counts), 
                        detection_rate = rowMeans(counts > 0),
                        var = apply(counts, 1, var))
gene_attr$log_mean <- log10(gene_attr$mean)
gene_attr$log_var <- log10(gene_attr$var)
rownames(gene_attr) <- rownames(counts)
cell_attr <- data.frame(n_umi = colSums(counts),
                        n_gene = colSums(counts > 0))
rownames(cell_attr) <- colnames(counts)

Gene attributes:

dim(gene_attr)
## [1] 25632     5
head(gene_attr)

Cell attributes:

dim(cell_attr)
## [1] 5500    2
head(cell_attr)

Mean-variance relationship

For the genes, we can see that up to a mean UMI count of 0 the variance follows the line through the origin with slop one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.

ggplot(gene_attr, aes(log_mean, log_var)) + 
  geom_point(alpha=0.3, shape=16) + 
  geom_density_2d(size = 0.3) +
  geom_abline(intercept = 0, slope = 1, color='red')

Mean-detection-rate relationship

In line with the previous plot, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.

# add the expected detection rate under Poisson model
x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x,
                detection_rate = 1 - dpois(0, lambda = 10^x))
ggplot(gene_attr, aes(log_mean, detection_rate)) + 
  geom_point(alpha=0.3, shape=16) + 
  geom_line(data=poisson_model, color='red') +
  theme_gray(base_size = 8)

ggplot(cell_attr, aes(n_umi, n_gene)) + 
  geom_point(alpha=0.3, shape=16) + 
  geom_density_2d(size = 0.3)

1.6.2 Transformation

“Based on the observations above, which are not unique to this particular data set, we propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count was away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.” sctransform vignette.

Estimate model parameters and transform data

The vst function estimates model parameters and performs the variance stabilizing transformation. Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean).

print(dim(counts))
## [1] 25632  5500
# We use the Future API for parallel processing; set parameters here
future::plan(strategy = 'multicore', workers = 7)
options(future.globals.maxSize = 10 * 1024 ^ 3)

set.seed(44)
vst_out <- sctransform::vst(counts,
                latent_var = c('log_umi'),
                return_gene_attr = TRUE,
                return_cell_attr = TRUE,
                show_progress = FALSE)
sctransform::plot_model_pars(vst_out)

Inspect model

print(vst_out$model_str)
## [1] "y ~ log_umi"

We will look at several genes in more detail.

rowData(sce) %>%
    as.data.frame %>%
    filter(Symbol %in% c('MALAT1', 'RPL10', 'FTL'))
sctransform::plot_model(vst_out,
            counts,
            c('ENSG00000251562', 'ENSG00000147403', 'ENSG00000087086'),
            plot_residual = TRUE)

sctransform::plot_model(vst_out,
            counts,
            c('ENSG00000087086'),
            plot_residual = TRUE,
            show_nr = TRUE,
            arrange_vertical = FALSE)

Distribution of residual mean:

ggplot(vst_out$gene_attr, aes(residual_mean)) +
    geom_histogram(binwidth=0.01)

Distribution of residual variance:

ggplot(vst_out$gene_attr, aes(residual_variance)) +
    geom_histogram(binwidth=0.1) +
    geom_vline(xintercept=1, color='red') +
    xlim(0, 10)

Variance against mean (residuals):

ggplot(vst_out$gene_attr, aes(x=residual_mean, y=residual_variance)) +
    geom_point(alpha=0.3, shape=16) + 
    #xlim(0, 0.5) +
    #ylim(0, 10) +
    xlim(0, quantile(vst_out$gene_attr$residual_mean, probs=0.95)) +
    ylim(0, quantile(vst_out$gene_attr$residual_variance, probs=0.95)) +
    geom_density_2d()

Variance against mean (genes):

ggplot(vst_out$gene_attr,
       aes(log10(gmean), residual_variance)) +
       geom_point(alpha=0.3, shape=16) +
       geom_density_2d(size = 0.3)

Check genes with large residual variance:

dd <- vst_out$gene_attr %>%
    arrange(-residual_variance) %>%
    slice_head(n = 22) %>%
    mutate(across(where(is.numeric), round, 2))

dd %>% tibble::rownames_to_column("ID") %>%
    left_join(as.data.frame(rowData(sce))[,c("ID", "Symbol")], "ID") %>%
    DT::datatable(rownames = FALSE)
# write to file
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_vst_out%s.Rds",
         projDir, outDirBit, setName, setSuf)
saveRDS(vst_out, tmpFn)

Check transformed values:

print(dim(vst_out$y))
## [1] 16875  5500
vst_out$y[1:10,1:5]
##                 AAACGGGCAGTTCATG-1 AAACGGGGTTCACCTC-1 AAAGATGAGCGATGAC-1
## ENSG00000237491        -0.17033439        -0.18173720        -0.09000010
## ENSG00000225880        -0.14952801        -0.15937263        -0.07997930
## ENSG00000230368        -0.17581790        -0.18762782        -0.09265018
## ENSG00000230699        -0.04802854        -0.05099038        -0.02729004
## ENSG00000188976        -0.52397723         0.80893827        -0.26769409
## ENSG00000187961        -0.07276045        -0.07745335        -0.03988471
## ENSG00000272512        -0.04521316        -0.04791950        -0.02612314
## ENSG00000188290        -0.49534924        -0.52866011        -0.25109949
## ENSG00000187608        -0.77154242        -0.07519498        -0.42880474
## ENSG00000188157        -0.08320715        -0.08857060        -0.04542483
##                 AAAGATGCAGCCAATT-1 AAAGTAGCAATGCCAT-1
## ENSG00000237491        -0.15305167        -0.14947673
## ENSG00000225880        -0.13460346        -0.13151510
## ENSG00000230368        -0.15791773        -0.15421520
## ENSG00000230699        -0.04357104        -0.04265208
## ENSG00000188976        -0.47052508        -0.45930017
## ENSG00000187961        -0.06568083        -0.06421988
## ENSG00000272512        -0.04113262        -0.04029012
## ENSG00000188290        -0.44389981        -0.43313499
## ENSG00000187608        -0.70699232        -0.69295685
## ENSG00000188157        -0.07509633        -0.07342004
sce
## class: SingleCellExperiment 
## dim: 25632 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(25632): ENSG00000238009 ENSG00000239945 ... ENSG00000275063
##   ENSG00000271254
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames: NULL
## colData names(18): Sample Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
print(assayNames(sce))
## [1] "counts"    "logcounts"

Genes that are expressed in fewer than 5 cells are not used and not returned, so to add vst_out$y as an assay we need to remove the missing genes.

# genes that are expressed in fewer than 5 cells are not used and not returned
# so to add vst_out$y as an assay we need to ditch the missing genes completely.
# https://github.com/ChristophH/sctransform/issues/27

sceOrig <- sce
sceOrig
## class: SingleCellExperiment 
## dim: 25632 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(25632): ENSG00000238009 ENSG00000239945 ... ENSG00000275063
##   ENSG00000271254
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames: NULL
## colData names(18): Sample Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
tmpInd <- which(rownames(sce) %in% rownames(vst_out$y))
cols.meta <- colData(sceOrig)
rows.meta <- rowData(sceOrig)

new.counts <- counts(sceOrig)[tmpInd, ]
sce <- SingleCellExperiment(list(counts=new.counts))

# reset the column data on the new object
colData(sce) <- cols.meta
rowData(sce) <- rows.meta[tmpInd, ]
assayNames(sce)
## [1] "counts"
sce
## class: SingleCellExperiment 
## dim: 16875 5500 
## metadata(0):
## assays(1): counts
## rownames(16875): ENSG00000237491 ENSG00000225880 ... ENSG00000276345
##   ENSG00000271254
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames: NULL
## colData names(18): Sample Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
vstMat <- as(vst_out$y[rownames(sce),], "dgCMatrix")
all(colnames(vstMat) == sce$Barcode)
## [1] TRUE
colnames(vstMat) <- NULL
assay(sce, "sctrans_norm") <- vstMat # as(vst_out$y[rownames(sce),], "dgCMatrix")
#assayNames(sce)

1.6.3 Save SCE object

# write to file
tmpFn <- sprintf("%s/%s/Robjects/%s_postSct%s.Rds",
         projDir, outDirBit, setName, setSuf)
saveRDS(sce, tmpFn)

1.7 Exercise 2

Exercise: apply the SCTransform normalisation on a single sample: ETV6-RUNX1_1 (aka GSM3872434).

In exercise 1, you have made a new SCE object with cells for SampleName ‘ETV6-RUNX1_1’. You will now inspect the mean-variance relationship and apply SCTransform to that data.

1.8 Extended materials - Visualisation

1.8.1 log raw counts

typeNorm <- "logRaw"

# approximate SVD with irlba
# irlba == implicitly restarted Lanczos bidiagonalization algorithm.
options(BiocSingularParam.default=IrlbaParam())

# Have rae counts on log2 scale:
assay(sce, "logcounts_raw") <- log2(counts(sce) + 1)

# Perform PCA:
set.seed(123)
tmp <- runPCA(
  sce,
  exprs_values = "logcounts_raw"#,
  #BSPARAM=IrlbaParam(),
  #BSPARAM=RandomParam(),
  #BPPARAM=bpp
)

PCA plot for the ‘logRaw’ counts in the caron set.

p <- plotPCA(
    tmp,
    colour_by = "SampleName",
    size_by = "sum",
    shape_by = "SampleGroup"
) + ggtitle(sprintf("PCA plot for log raw counts: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sPca.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sPca.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)
p <- plotRLE(
    #tmp[,1:10],
    tmp,
    exprs_values = "logcounts_raw",
    colour_by = "SampleName"
) + ggtitle(sprintf("RLE plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sRle.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sRle.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

1.8.2 log CPM

typeNorm <- "logCpm"

assay(sce, "logCpm") <- log2(calculateCPM(sce, size_factors = NULL)+1)

logCpmPca <- runPCA(
  sce,
  exprs_values = "logCpm"#,
  #BPPARAM=bpp
)

PCA plot for the ‘logCpm’ counts in the caron set.

p <- plotPCA(
    logCpmPca,
    colour_by = "SampleName",
    size_by = "sum",
    shape_by = "SampleGroup"
) + ggtitle(sprintf("PCA plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sPca.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sPca.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)
p <- plotRLE(
    sce,
    exprs_values = "logCpm",
    colour_by = "SampleName"
) + ggtitle(sprintf("RLE plot: %s", typeNorm))

p
#p + coord_cartesian(ylim=c(-0.5,0.5))

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sRle.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sRle.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

1.8.3 scran

Normalised counts are stored in the ‘logcounts’ assay

typeNorm <- "scran"

# assay(sce, "logcounts")

scranPca <- runPCA(
  sceDeconv,
  exprs_values = "logcounts"#,
  #BPPARAM=bpp
)

PCA plot for the ‘scran’ counts in the caron set.

p <- plotPCA(
    scranPca,
    colour_by = "SampleName",
    size_by = "sum",
    shape_by = "SampleGroup"
) + ggtitle(sprintf("PCA plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sPca.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sPca.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

TSNE plot for the ‘scran’ counts in the caron set.

typeNorm <- "scran"

reducedDim(sceDeconv, "TSNE_scran") <- reducedDim(
  runTSNE(sceDeconv,
      exprs_values = "logcounts",
      BPPARAM=bpp),
  "TSNE"
)
p <- plotReducedDim(
  sceDeconv,
  dimred = "TSNE_scran",
  colour_by = "SampleName",
  size_by = "sum",
  shape_by = "SampleGroup"
) + ggtitle(sprintf("TSNE plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sTsne.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sTsne.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

UMAP plot for the ‘scran’ counts in the caron set.

typeNorm <- "scran"

reducedDim(sceDeconv, "UMAP_scran") <- reducedDim(
  runUMAP(sceDeconv,
      exprs_values = "logcounts",
      BPPARAM=bpp),
  "UMAP"
)
p <- plotReducedDim(
  sceDeconv,
  dimred = "UMAP_scran",
  colour_by = "SampleName",
  size_by = "sum",
  shape_by = "SampleGroup"
) + ggtitle(sprintf("UMAP plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sUmap.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sUmap.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)
p <- plotRLE(
    scranPca,
    exprs_values = "logcounts",
    colour_by = "SampleName"
) + ggtitle(sprintf("RLE plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sRle.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sRle.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

1.8.4 SCTransform

typeNorm <- "sctrans"

reducedDim(sce, "PCA_sctrans_norm") <- reducedDim(
  runPCA(sce,
     exprs_values = "sctrans_norm"#,
     #BPPARAM=bpp
     ),
  "PCA"
)

PCA plot for the ‘sctrans’ counts in the caron set.

p <- plotReducedDim(
  sce,
  dimred = "PCA_sctrans_norm",
  colour_by = "SampleName",
  size_by = "sum",
  shape_by = "SampleGroup"
) + ggtitle(sprintf("PCA plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sPca.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sPca.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

TSNE plot for the ‘sctrans’ counts in the caron set.

typeNorm <- "sctrans"

reducedDim(sce, "TSNE_sctrans_norm") <- reducedDim(
  runTSNE(sce,
      exprs_values = "sctrans_norm",
      BPPARAM=bpp),
  "TSNE"
)
p <- plotReducedDim(
  sce,
  dimred = "TSNE_sctrans_norm",
  colour_by = "SampleName",
  size_by = "sum",
  shape_by = "SampleGroup"
) + ggtitle(sprintf("TSNE plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sTsne.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sTsne.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

UMAP plot for the ‘sctrans’ counts in the caron set.

typeNorm <- "sctrans"

reducedDim(sce, "UMAP_sctrans_norm") <- reducedDim(
  runUMAP(sce,
      exprs_values = "sctrans_norm",
      BPPARAM=bpp),
  "UMAP"
)
p <- plotReducedDim(
  sce,
  dimred = "UMAP_sctrans_norm",
  colour_by = "SampleName",
  size_by = "sum",
  shape_by = "SampleGroup"
) + ggtitle(sprintf("UMAP plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sUmap.png",
##       projDir, outDirBit, qcPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sUmap.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

Cell-wise RLE for the ‘sctrans’ counts in the caron set. Each cell is represented by a box plot showing the inter-quartile range in grey, wiskers colour-coded by Sample.Name and the median as a black circle.

p <- plotRLE(
    sce,
    exprs_values = "sctrans_norm",
    colour_by = "SampleName"
) + ggtitle(sprintf("RLE plot: %s", typeNorm))

p

# write plot to file:
##tmpFn <- sprintf("%s/%s/%s/%s_sce_nz_postQc%s_%sRle.png",
##       projDir, outDirBit, normPlotDirBit, setName, setSuf, typeNorm)
##ggsave(filename=tmpFn, plot=p, type="cairo-png")
tmpFn <- sprintf("%s/%s/%s_sce_nz_postQc%s_%sRle.png",
         dirRel, qcPlotDirBit, setName, setSuf, typeNorm)
knitr::include_graphics(tmpFn, auto_pdf = TRUE)
rm(tmpFn)

1.9 Session information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BiocParallel_1.26.0         BiocSingular_1.8.0         
##  [3] dplyr_1.0.6                 scran_1.20.1               
##  [5] scater_1.20.0               ggplot2_3.3.3              
##  [7] scuttle_1.2.0               SingleCellExperiment_1.14.1
##  [9] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [11] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
## [13] IRanges_2.26.0              S4Vectors_0.30.0           
## [15] BiocGenerics_0.38.0         MatrixGenerics_1.4.0       
## [17] matrixStats_0.59.0          knitr_1.33                 
## 
## loaded via a namespace (and not attached):
##  [1] Rtsne_0.15                ggbeeswarm_0.6.0         
##  [3] colorspace_2.0-1          ellipsis_0.3.2           
##  [5] bluster_1.2.1             XVector_0.32.0           
##  [7] BiocNeighbors_1.10.0      rstudioapi_0.13          
##  [9] listenv_0.8.0             farver_2.1.0             
## [11] DT_0.18                   RSpectra_0.16-0          
## [13] fansi_0.5.0               codetools_0.2-18         
## [15] sparseMatrixStats_1.4.0   jsonlite_1.7.2           
## [17] cluster_2.1.2             png_0.1-7                
## [19] uwot_0.1.10               sctransform_0.3.2        
## [21] readr_1.4.0               compiler_4.1.0           
## [23] dqrng_0.3.0               assertthat_0.2.1         
## [25] Matrix_1.3-3              limma_3.48.0             
## [27] cli_2.5.0                 htmltools_0.5.1.1        
## [29] tools_4.1.0               rsvd_1.0.5               
## [31] igraph_1.2.6              gtable_0.3.0             
## [33] glue_1.4.2                GenomeInfoDbData_1.2.6   
## [35] reshape2_1.4.4            Rcpp_1.0.6               
## [37] jquerylib_0.1.4           vctrs_0.3.8              
## [39] crosstalk_1.1.1           DelayedMatrixStats_1.14.0
## [41] xfun_0.23                 stringr_1.4.0            
## [43] globals_0.14.0            beachmat_2.8.0           
## [45] lifecycle_1.0.0           irlba_2.3.3              
## [47] statmod_1.4.36            future_1.21.0            
## [49] edgeR_3.34.0              zlibbioc_1.38.0          
## [51] MASS_7.3-54               scales_1.1.1             
## [53] hms_1.1.0                 yaml_2.2.1               
## [55] gridExtra_2.3             sass_0.4.0               
## [57] stringi_1.6.2             highr_0.9                
## [59] ScaledMatrix_1.0.0        rlang_0.4.11             
## [61] pkgconfig_2.0.3           bitops_1.0-7             
## [63] evaluate_0.14             lattice_0.20-44          
## [65] purrr_0.3.4               htmlwidgets_1.5.3        
## [67] labeling_0.4.2            cowplot_1.1.1            
## [69] tidyselect_1.1.1          RcppAnnoy_0.0.18         
## [71] parallelly_1.25.0         plyr_1.8.6               
## [73] magrittr_2.0.1            R6_2.5.0                 
## [75] generics_0.1.0            metapod_1.0.0            
## [77] DelayedArray_0.18.0       DBI_1.1.1                
## [79] pillar_1.6.1              withr_2.4.2              
## [81] RCurl_1.98-1.3            tibble_3.1.2             
## [83] future.apply_1.7.0        crayon_1.4.1             
## [85] utf8_1.2.1                rmarkdown_2.8            
## [87] viridis_0.6.1             locfit_1.5-9.4           
## [89] grid_4.1.0                isoband_0.2.4            
## [91] digest_0.6.27             munsell_0.5.0            
## [93] beeswarm_0.4.0            viridisLite_0.4.0        
## [95] vipor_0.4.5               bslib_0.2.5.1